## Intel<sup>®</sup> oneAPI VTune<sup>™</sup> Profiler 2021.1.1 Gold

**Elapsed Time:** 0.045s

Application execution time is too short. Metrics data may be unreliable. Consider reducing the sampling interval or increasing your application execution time.

 Clockticks:
 41,940,000

 Instructions Retired:
 66,960,000

 CPI Rate:
 0.626

MUX Reliability: 0.626

**Retiring:** 30.3% of Pipeline Slots Light Operations: 34.2% of Pipeline Slots

FP Arithmetic:
FP x87:
0.0% of uOps
FP Scalar:
0.0% of uOps
0.0% of uOps
0.0% of uOps
0.0% of uOps
100.0% of uOps
0.0% of uOps
0.0% of Pipeline Slots

Microcode Sequencer: 7.0% of Pipeline Slots
Assists: 0.0% of Pipeline Slots

Front-End Bound:

Front-End Latency:
ICache Misses:
ITLB Overhead:
Branch Resteers:
Mispredicts Resteers:
Clears Resteers:

0.0% of Pipeline Slots
11.0% of Pipeline Slots
0.0% of Clockticks
0.0% of Clockticks
0.0% of Clockticks
0.0% of Clockticks

Unknown Branches: 0.0% of Clockticks
DSB Switches: 0.0% of Clockticks
Length Changing Prefixes: 0.0% of Clockticks
MS Switches: 0.0% of Clockticks
Front-End Bandwidth: 5.5% of Pipeline Slots
Front-End Bandwidth MITE: 33.1% of Clockticks
Front-End Bandwidth DSB: 0.0% of Clockticks

(Info) DSB Coverage: 46.2%

Bad Speculation:
Branch Mispredict:
Machine Clears:

Back-End Bound:

5.5% of Pipeline Slots

0.0% of Pipeline Slots

5.5% of Pipeline Slots

A significant portion of pipeline slots are remaining empty. When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable to support. This opportunity cost results in slower execution. Long-latency operations like divides and memory operations can cause

this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).

## **Memory Bound:** 36.5% of Pipeline Slots

The metric value is high. This can indicate that the significant fraction of execution pipeline slots could be stalled due to demand memory load and stores. Use Memory Access analysis to have the metric breakdown by memory hierarchy, memory bandwidth information, correlation by memory objects.

L1 Bound:

DTLB Overhead:

Load STLB Hit:

Load STLB Miss:

0.0% of Clockticks

0.0% of Clockticks

1.3% of Clockticks

1.3% of Clockticks

**Loads Blocked by Store Forwarding:** 0.0% of Clockticks

Lock Latency:
Split Loads:
4K Aliasing:
FB Full:

L2 Bound:

0.0% of Clockticks
0.0% of Clockticks
0.0% of Clockticks

This metric shows how often machine was stalled on L2 cache. Avoiding cache misses (L1 misses/L2 hits) will improve the latency and increase performance.

L3 Bound:
Contested Accesses:
Data Sharing:
L3 Latency:
SQ Full:

DRAM Bound:

0.0% of Clockticks
0.0% of Clockticks
0.0% of Clockticks
0.0% of Clockticks

Memory Bandwidth:

Memory Latency:

0.0% of Clockticks

25.8% of Clockticks

Store Bound:
Store Latency:
False Sharing:
Split Stores:
DTLB Store Overhead:

0.0% of Clockticks
0.0% of Clockticks
0.0% of Clockticks
2.2% of Clockticks

Store STLB Hit:

Store STLB Hit:

0.0% of Clockticks
2.2% of Clockticks

**Core Bound:** 11.1% of Pipeline Slots

This metric represents how much Core non-memory issues were of a bottleneck. Shortage in hardware compute

resources, or dependencies software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an 000 resources, certain execution units are overloaded or dependencies in program's data- or instruction- flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).

Divider: 0.0% of Clockticks **Port Utilization:** 3.9% of Clockticks **Cycles of 0 Ports Utilized:** 11.0% of Clockticks **Serializing Operations:** 0.0% of Clockticks Mixing Vectors: 0.0% of uOps Cycles of 1 Port Utilized: 5.5% of Clockticks Cycles of 2 Ports Utilized: 5.5% of Clockticks Cycles of 3+ Ports Utilized: 16.6% of Clockticks **ALU Operation Utilization:** 27.6% of Clockticks Port 0: 22.1% of Clockticks Port 1: 22.1% of Clockticks Port 5: 22.1% of Clockticks 44.1% of Clockticks Port 6: **Load Operation Utilization:** 22.1% of Clockticks Port 2: 33.1% of Clockticks Port 3: 33.1% of Clockticks **Store Operation Utilization:** 22.1% of Clockticks Port 4: 22.1% of Clockticks Port 7: 0.0% of Clockticks **Vector Capacity Usage (FPU): 0.0% Average CPU Frequency:** 1.050 GHz

**Total Thread Count:** 1 **Paused Time:** 0s

## **Effective Physical Core Utilization:** 26.0% (1.042 out of 4)

The metric value is low, which may signal a poor physical CPU cores utilization caused by:

- load imbalance
- threading runtime overhead
- contended synchronization
- thread/process underutilization
- incorrect affinity that utilizes logical cores instead of physical cores

Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism or run the Locks and Waits analysis to identify parallel bottlenecks for other parallel runtimes.

**Effective Logical Core Utilization:** 11.2% (0.893 out of 8)

The metric value is low, which may signal a poor logical CPU cores utilization. Consider improving physical core utilization as the first step and then look at opportunities to utilize logical cores, which in some cases can improve processor throughput and overall performance of multi-threaded applications.

## **Collection and Platform Info:**

**Application Command Line:** ./codecs/hm/decoder/TAppDecoderStatic "-b" "./bin/hm/encoder\_randomaccess\_main.cfg/CLASS\_C/ RaceHorses 416x240 30 QP 37 hm.bin"

**User Name:** root

**Operating System:** 5.4.0-65-generic DISTRIB\_ID=Ubuntu DISTRIB\_RELEASE=18.04 DISTRIB\_CODENAME=bionic DISTRIB DESCRIPTION="Ubuntu 18.04.5 LTS"

**Computer Name:** eimon

**Result Size:** 9.4 MB

**Collection start time:** 09:55:32 10/02/2021 UTC

**Collection stop time:** 09:55:32 10/02/2021 UTC

**Collector Type:** Event-based sampling driver

CPU:

Name: Intel(R) Processor code named Kabvlake

ULX

Frequency: 1.992 GHz

**Logical CPU Count:** 8

**Cache Allocation Technology:** 

**Level 2 capability:** not detected

**Level 3 capability:** not detected